Focused Crawler Based on Improved Algorithm of Web Content Similarity

doi:10.3969/j.issn.1006-2475.2011.09.001

Computer and Modernization ›› 2011, Vol. 193 ›› Issue (9): 1-4.doi: 10.3969/j.issn.1006-2475.2011.09.001

• 算法设计与分析 • Next Articles

Focused Crawler Based on Improved Algorithm of Web Content Similarity

WEI Jing-jing¹, YANG Ding-da², LIAO Xiang-wen²

1.Department of Electronics and Information Science, Fujian Jiangxia University, Fuzhou 350108, China; 2.College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350108, China

Received:2011-05-06 Revised:1900-01-01 Online:2011-09-22 Published:2011-09-22

Abstract

Abstract: Focused crawler is an important part of the vertical search engine. The Web content relevance algorithm of traditional focused crawler only considers term frequency, ignores the location information of key terms. After the analysis of the focused crawler based on the Web content relevance, this paper proposes an improved method of calculating relevance using the features of HTML tags. Experimental results show that the average accuracy of improved algorithm is 64.99% and increases 15.37% compared to the original method.

Key words: search engine, focused crawler, similarity, vector space model, HTML tags

CLC Number:

TP301.6

WEI Jing-jing;YANG Ding-da;LIAO Xiang-wen. Focused Crawler Based on Improved Algorithm of Web Content Similarity[J]. Computer and Modernization, 2011, 193(9): 1-4.

[1]	YANG Jun1, HU Wei1, ZHU Wenfu2. Visual SLAM Loop Closure Detection Algorithm Based on Improved MobileNetV3 [J]. Computer and Modernization, 2024, 0(10): 21-26.
[2]	LIU Wenliang1, WU Fei1, HE Deming1, ZHAO Weiwei2, PAN Jianhong3. Text Clustering Method for Fragmented Reply Based on Dissimilarity Matrix [J]. Computer and Modernization, 2024, 0(09): 56-60.
[3]	Renqingzhuoma1, 2, 3, Yongcuo1, 2, 3, TANG Chaochao1, 2, 3. Data Filtering Strategies for Tibetan-Chinese Neural Machine Translation [J]. Computer and Modernization, 2024, 0(06): 19-24.
[4]	WANG Zi-chen, QU You-li. A Partition Inverted Index Compression Algorithm Based on CRF [J]. Computer and Modernization, 2024, 0(02): 36-42.
[5]	WANG Hong-jie, XU Sheng-chao. Clustering Method of Cloud Platform Abnormal Transmission Data Based on Hilbert Similarity [J]. Computer and Modernization, 2023, 0(09): 27-31.
[6]	WANG Hong, GE Hong. Cross Modal Hash Retrieval Based on Attention Mechanism and Semantic Similarity [J]. Computer and Modernization, 2023, 0(08): 44-53.
[7]	LIU Guo-li, XU Hong-nan, TAN You-qian. Collaborative Filtering Recommendation Algorithm Combined with Expert Trust [J]. Computer and Modernization, 2022, 0(11): 60-68.
[8]	TIAN Feng, DENG Xiao-ping, ZHANG Gui-qing, WANG Bao-yi. A Non-intrusive Load Monitoring Method Based on Improved kNN Algorithm and Transient Steady State Features [J]. Computer and Modernization, 2022, 0(10): 29-35.
[9]	QIU Jin-shui, ZHUANG Hui-fu, JIN Tao. Design of Intelligent Retrieval System for Massive Plant Images [J]. Computer and Modernization, 2022, 0(10): 62-67.
[10]	ZENG Yi-bin, GE Hong. Cross-modal Retrieval Based on Context Fusion and Multi-similarity Learning [J]. Computer and Modernization, 2022, 0(08): 50-56.
[11]	ZOU Meng-yuan, FAN Zhi-qiang, XU Luo, LIU Jie, LIANG Wan-lu. Similarity Measurement Method of Inf-ProA Information Activity Process Model [J]. Computer and Modernization, 2022, 0(02): 26-32.
[12]	ZHU Ding-kai, TIE Zhi-xin, HONG Shun-he. An Initialization Algorithm of HRG Model and Its Application in Link Prediction [J]. Computer and Modernization, 2022, 0(02): 38-44.
[13]	XU Xian-hui, WANG Shu-ying, ZENG Wen-qu. ElasticSearch Index Optimization Strategy for Engineering Data Retrieval [J]. Computer and Modernization, 2022, 0(02): 79-84.
[14]	GENG Hua-cong, LIANG Hong-tao, LIU Guo-zhu. Recipe Recommendation Algorithm Based on Knowledge Graph and Collaborative Filtering [J]. Computer and Modernization, 2021, 0(08): 24-29.
[15]	WAN Yang-ye, GUO Jin-li. Link Prediction Algorithm Based on Resource Allocation and Graph Embedding Weighting [J]. Computer and Modernization, 2021, 0(07): 12-17.

Focused Crawler Based on Improved Algorithm of Web Content Similarity

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments